Recap
Interpretation
Least Squares predictions \(\widehat{Y} = X'\beta\) gives the linear approximation of \(E(Y_i | X_i)\) that has the smallest mean-squared error (minimum distance to the true \(E(Y_i | X_i)\)).
Alternatively, least squares is a particular weighted average of derivative of non-linear CEF (board)
What does this mean?
\[\mathbf{X}\beta = \begin{pmatrix} 1 & x_{11} & x_{21} \\ \vdots & \vdots & \vdots \\ 1 & x_{1n} & x_{2n} \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \hat{Y}\]
Least Squares can predict group means when using binary variables:
Despite different coefficients, least squares can make identical predictions \(\hat{y}\)
We need to specify design matrix/model to get estimates of interest. (e.g. difference in means)
If we have exclusive indicator variables for belonging to distinct categories (sometimes called “dummy variables”), must drop one group if we fit an intercept. (Why?)
If we fit group means: residuals are deviation from group means (centered at 0 within each group).
So far, we have considered fitting group means.
We want to estimate differences in earnings across gender, but we want to control for the sector of employment.
In this data, we only have doctors and lawyers, so we can estimate the following:
\(Earnings_i = b_0 + b_1 \ Female_i + b_2 \ Medicine_i\)
Where \(Female_i\) is \(1\) if person is female, \(0\) if male. \(Medicine_i\) is \(1\) if they are a doctor \(0\) if they are a lawyer.
Let’s now find the (approximate) linear conditional expectation function of earnings, across gender and profession:
##
## Call:
## lm(formula = INCEARN ~ FEMALE + MEDICINE, data = acs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -226109 -93592 -35873 81891 762891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 176243 2034 86.66 <2e-16 ***
## FEMALE -55370 2824 -19.61 <2e-16 ***
## MEDICINE 55866 2670 20.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 132700 on 9997 degrees of freedom
## Multiple R-squared: 0.07833, Adjusted R-squared: 0.07814
## F-statistic: 424.8 on 2 and 9997 DF, p-value: < 2.2e-16
How do we make sense of, e.g., the slope on FEMALE?
Let’s now find the (approximate) linear conditional expectation function of earnings, across gender and hours worked:
We’re not controlling for a binary indicator, but a continuous variable.
##
## Call:
## lm(formula = INCEARN ~ FEMALE + UHRSWORK, data = acs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -243715 -98889 -38988 90987 796111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 136821.6 6268.7 21.83 <2e-16 ***
## FEMALE -53694.3 2886.5 -18.60 <2e-16 ***
## UHRSWORK 1241.4 115.3 10.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134800 on 9997 degrees of freedom
## Multiple R-squared: 0.04898, Adjusted R-squared: 0.04879
## F-statistic: 257.4 on 2 and 9997 DF, p-value: < 2.2e-16
How do we make sense of, e.g., the slope on FEMALE?
Gender and age orthogonal but NOT independent
What could we do if we really wanted to “hold hours worked constant”?
Compare earnings by gender within groups where hours are the same (think about our indicator variables from earlier)?
We can ensure that gender is exactly unrelated to each year of age by fitting an intercept for each age.
If the the CEF, \(E[Y|X]\) is not linear in \(X\), we can still use least squares to model this relationship.
The easiest choice is to use a polynomial expansion of \(X\).
If a straight-line relationship between \(X\) and \(Y\) is clearly wrong, we can model a “U”-shape by adding a squared term of \(X\):
\(Earnings_i = b_0 + b_1 Female + b_2 Hours + b_3 Hours^2\)
It is “linear” in that we still multiply values of \(X\) by \(\beta\) and sum, but we use non-linear transformations of the data.
Can we do better?
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.546584e+05 8.532942e+04 -6.500201 8.404065e-11
## FEMALE -4.987018e+04 2.870964e+03 -17.370536 1.305089e-66
## poly(UHRSWORK, 3, raw = T)1 3.302792e+04 4.423739e+03 7.466063 8.952876e-14
## poly(UHRSWORK, 3, raw = T)2 -4.500185e+02 7.373786e+01 -6.102950 1.079941e-09
## poly(UHRSWORK, 3, raw = T)3 1.934612e+00 3.946976e-01 4.901505 9.660031e-07
We can incorporate all kinds of non-linearity:
But there are trade-offs…
Linear fit
Perfect Polynomial fit…
Perfect Polynomial fit?
We risk overfitting the data, leading to very bad extrapolations/interpolations.
See Aronow and Miller, Chapter 4.3.4
Basically, if you non-linearly transform \(Y\) or \(X\), you need to check how to interpret.
Recall:
When there is possible confounding, we want to “block” these causal paths using conditioning
In order for conditioning to estimate the \(ACE\) without bias, we must assume
\(1\). Ignorability/Conditional Independence: within strata of \(X\), potential outcomes of \(Y\) must be independent of cause \(D\) (i.e. within values of \(X\), \(D\) must be as-if random)
In order for conditioning to estimate the \(ACE\) without bias, we must assume
\(2\). Positivity/Common Support: For all values of treatment \(d\) in \(D\) and all value of \(x\) in \(X\): \(Pr(D = d | X = x) > 0\) and \(Pr(D = d | X = x) < 1\)
In order for conditioning to estimate the \(ACE\) without bias, we must assume
We find effect of \(D\) on \(Y\) within each subset of the data uniquely defined by values of \(X_i\).
\(\widehat{ACE}[X = x] = E[Y(1) | D=1, X = x] - E[Y(0) | D=0, X = x]\)
for each value of \(x\) in the data.
Under the Conditional Independence Assumption, \(E[Y(1) | D=1, X=x] = \color{red}{E[Y(1) | D=0, X=x]}\) and \(E[Y(0) | D=0, X=x] = \color{red}{E[Y(0) | D=1, X=x]}\)
We impute the missing potential outcomes… with the expected value of the outcome of observed cases with same values of \(x\), different values of \(D\).
what have we seen that looks like these values: e.g., \(E[Y(0) | D=0, X=x]\)?
To understand how we can use regression to estimate causal effects, we need to describe the kinds of causal estimands we might be interested in.
In the context of experiments, each observation has potential outcomes corresponding to their behavior under different treatments
In regression, where levels of treatment might be continuous, we generalize this idea to the “response schedule”:
average causal response function:
average partial derivative:
We can use regression to estimate the linear approximation of the average causal response function:
\[Y_i(D_i = d) = \beta_0 + \beta_1 D_i + \epsilon_i\]
Here \(Y_i(D_i = d)\) is the potential outcome of case \(i\) for a value of \(D = d\).
If we don’t know parameters \(\beta_0, \beta_1\), what do we need to assume to obtain an estimate \(\widehat{\beta}_1\) that we can give a causal interpretation? (On average, change in \(D\) causes \(\widehat{\beta}_1\) change in \(Y\))
We must assume
In this scenario, if \(D\) were binary and we had randomization, this is equivalent to estimating the \(ACE\) for an experiment.
If we want to use regression for conditioning, then the model would look different:
\[Y_i(D_i = d, X_i = x) = \beta_0 + \beta_1 D_i + \mathbf{X_i\beta_i} + \epsilon_i\]
\[Y_i(D_i = d, X_i = x) = \beta_0 + \beta_1 D_i + \mathbf{X_i\beta_i} + \epsilon_i\]
Given what we have learned about regression so far…
How it works:
Imagine we want to know the efficacy of UN Peacekeeping operations (Doyle & Sambanis 2000) after civil wars:
We can compare the post-conflict outcomes of countries with and without UN Peacekeeping operations.
To address concern about confounding, we condition on war type (non/ethnic), war deaths, war duration, number of factions, economic assistance, energy consumption, natural resource dependence, and whether the civil war ended in a treaty.
122 conflicts… can we find exact matches?
Without perfect matches on possible confounders, we don’t have cases without a Peacekeeping operation that we can use to substitute for the counterfactual outcome in conflicts with a Peacekeeping force.
We can use regression to linearly approximate the conditional expectation function \(E[Y(d) | D = d, X = x]\) to plug in the missing values.
\[Y_i = \beta_0 + \beta_D \cdot D_i + \mathbf{X_i\beta_X} + \epsilon_i\]
Copy the data, ds2000, from here: https://pastebin.com/bAcmrPdN
success on untype4,
logcost, wardur, factnum,
trnsfcap, treaty', develop,
exp, decade, using lm(), save
this as mds2000, and flip the value of
untype4 (0 to 1, 1 to 0).predict(m, newdata = ds2000_copy) to add a new
column called y_hat to ds2000.Next:
y1, which equals
success for cases with untype4 == 1, and
yhat for cases with untype4 == 0y0, which equals
success for cases with untype4 == 0, and
yhat for cases with untype4 == 1tau_i as the difference between
y1 and y0. Then calculate the mean
tau_i.untype4 in your
regression results.m = lm(success ~ untype4 + treaty + wartype + decade +
factnum + logcost + wardur + trnsfcap + develop + exp,
data = ds2000)
cf_ds2000 = ds2000 %>% as.data.frame
cf_ds2000$untype4 = 1*!(cf_ds2000$untype4)
ds2000[, y_hat := predict(m, newdata = cf_ds2000)]
ds2000[, y1 := ifelse(untype4 %in% 1, success, y_hat)]
ds2000[, y0 := ifelse(untype4 %in% 0, success, y_hat)]
ds2000[, tau := y1 - y0]
ds2000[, tau] %>% mean## [1] 0.4394185
| Model 1 | |
|---|---|
| (Intercept) | 1.544*** (0.201) |
| untype4 | 0.439* (0.169) |
| treaty | 0.302** (0.093) |
| wartype | −0.235** (0.075) |
| decade | −0.035 (0.026) |
| factnum | −0.070** (0.026) |
| logcost | −0.065*** (0.016) |
| wardur | 0.001 (0.000) |
| trnsfcap | 0.000 (0.000) |
| develop | 0.000 (0.000) |
| exp | −0.981* (0.440) |
| Num.Obs. | 122 |
| R2 | 0.428 |
| RMSE | 0.36 |
| \(i\) | \(UN_i\) | \(Y_i(1)\) | \(Y_i(0)\) | \(\tau_i\) |
|---|---|---|---|---|
| 111 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 0.00 | \(\color{red}{?}\) |
| 112 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 0.00 | \(\color{red}{?}\) |
| 113 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 0.00 | \(\color{red}{?}\) |
| 114 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 1.00 | \(\color{red}{?}\) |
| 115 | 0 | \(\color{red}{E[Y(1) | D = 0, X = x]}\) | 1.00 | \(\color{red}{?}\) |
| 116 | 1 | 0.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| 117 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| 118 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| 119 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| 120 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| 121 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| 122 | 1 | 1.00 | \(\color{red}{E[Y(0) | D = 1, X = x]}\) | \(\color{red}{?}\) |
| \(i\) | \(UN_i\) | \(Y_i(1)\) | \(Y_i(0)\) | \(\tau_i\) |
|---|---|---|---|---|
| 111 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 0.00 | \(\color{red}{?}\) |
| 112 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 0.00 | \(\color{red}{?}\) |
| 113 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 0.00 | \(\color{red}{?}\) |
| 114 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 1.00 | \(\color{red}{?}\) |
| 115 | 0 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 1 + \mathbf{X_i\widehat{\beta_{X}}}}\) | 1.00 | \(\color{red}{?}\) |
| 116 | 1 | 0.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| 117 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| 118 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| 119 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| 120 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| 121 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| 122 | 1 | 1.00 | \(\color{red}{\widehat{\beta_0} + \widehat{\beta_{D}} \cdot 0 + \mathbf{X_i\widehat{\beta_{X}}}}\) | \(\color{red}{?}\) |
| \(i\) | \(UN_i\) | \(Y_i(1)\) | \(Y_i(0)\) | \(\tau_i\) |
|---|---|---|---|---|
| 111 | 0 | 0.70 | 0.00 | \(\color{red}{0.70}\) |
| 112 | 0 | 0.68 | 0.00 | \(\color{red}{0.68}\) |
| 113 | 0 | 0.68 | 0.00 | \(\color{red}{0.68}\) |
| 114 | 0 | 0.87 | 1.00 | \(\color{red}{-0.13}\) |
| 115 | 0 | 0.97 | 1.00 | \(\color{red}{-0.03}\) |
| 116 | 1 | 0.00 | 0.10 | \(\color{red}{-0.10}\) |
| 117 | 1 | 1.00 | 0.41 | \(\color{red}{0.59}\) |
| 118 | 1 | 1.00 | 0.34 | \(\color{red}{0.66}\) |
| 119 | 1 | 1.00 | 0.54 | \(\color{red}{0.46}\) |
| 120 | 1 | 1.00 | 0.73 | \(\color{red}{0.27}\) |
| 121 | 1 | 1.00 | 0.33 | \(\color{red}{0.67}\) |
| 122 | 1 | 1.00 | 0.47 | \(\color{red}{0.53}\) |
How it works:
Assumptions
Assumptions
If the true process generating the data is:
\[Y_i = \beta_0 + \beta_1 D_i + \beta_2 X_i + \nu_i\]
with \((D_i,X_i) \perp \!\!\! \perp \nu_i\), \(E(\nu_i) = 0\)
What happens when we estimate this model with a constant and \(D_i\) but exclude \(X_i\)?
\[Y_i = \beta_0 + \beta_1 D_i + \epsilon_i\]
\[\small\begin{eqnarray} \widehat{\beta_1} &=& \frac{Cov(D_i, Y_i)}{Var(D_i)} \\ &=& \frac{Cov(D_i, \beta_0 + \beta_1 D_i + \beta_2 X_i + \nu_i)}{Var(D_i)} \\ &=& \frac{Cov(D_i, \beta_1 D_i)}{Var(D_i)} + \frac{Cov(D_i,\beta_2 X_i)}{Var(D_i)} + \frac{Cov(D_i,\nu_i)}{Var(D_i)} \\ &=& \beta_1\frac{Var(D_i)}{Var(D_i)} + \beta_2\frac{Cov(D_i, X_i)}{Var(D_i)} \\ &=& \beta_1 + \beta_2\frac{Cov(D_i, X_i)}{Var(D_i)} \end{eqnarray}\]
So, \(E(\widehat{\beta_1}) \neq \beta_1\), it is biased
When we exclude \(X_i\) from the regression, we get:
\[\widehat{\beta_1} = \beta_1 + \beta_2\frac{Cov(D_i, X_i)}{Var(D_i)}\]
This is omitted variable bias
Excluding \(X\) from the model: \(\widehat{\beta_1} = \beta_1 + \beta_2\frac{Cov(D_i, X_i)}{Var(D_i)}\)
What is the direction of the bias when:
\(\beta_2 > 0\); \(\frac{Cov(D_i, X_i)}{Var(D_i)} < 0\)
\(\beta_2 < 0\); \(\frac{Cov(D_i, X_i)}{Var(D_i)} < 0\)
\(\beta_2 > 0\); \(\frac{Cov(D_i, X_i)}{Var(D_i)} > 0\)
\(\beta_2 = 0\); \(\frac{Cov(D_i, X_i)}{Var(D_i)} > 0\)
\(\beta_2 > 0\); \(\frac{Cov(D_i, X_i)}{Var(D_i)} = 0\)
This only yields bias if two conditions are true:
\(\beta_2 \neq 0\): omitted variable \(X\) has an effect on \(Y\)
\(\frac{Cov(D_i, X_i)}{Var(D_i)} \neq 0\): omitted variable \(X\) is correlated with \(D\). (on the same backdoor path)
This is why we don’t need to include EVERYTHING that might affect \(Y\) in our regression equation; only those variables that affect treatment and the outcome.
Link to DAGs:
Link to Conditional Independence:
Link to linearity:
Let’s turn to a different example: do hours increase earnings?
Let’s say that we estimate this model, and we have included all possible confounders.
\[\begin{eqnarray}Y_i = \beta_0 + \beta_1 Hours_i + \beta_2 Female_i + \\ \beta_3 Age_i + \beta_4 Law_i + \epsilon_i\end{eqnarray}\]
And we want to find \(\beta_1\) with \(\widehat{\beta_1}\)
If we are imputing missing potential outcomes of earnings for different hours worked using this model…
We need to ask…
Assuming additivity and linearity:
| Linear/Additive | |
|---|---|
| Hours Worked | 1076*** (115) |
| Male | 37223*** (2800) |
| Age (Years) | 3453*** (125) |
| Law | −48920*** (2675) |
| Num.Obs. | 10000 |
| R2 | 0.146 |
| RMSE | 127723.58 |
Non-linear dependence between D and X, despite regression
Assuming linearity, not additivity: linear relationship between Age and hours/earnings varies by gender and profession
| Linear/Additive | Linear/Interactive | |
|---|---|---|
| Hours Worked | 1076*** (115) | 1175*** (114) |
| Male | 37223*** (2800) | 51887** (18118) |
| Age (Years) | 3453*** (125) | 4985*** (341) |
| Law | −48920*** (2675) | 105972*** (19357) |
| Num.Obs. | 10000 | 10000 |
| R2 | 0.146 | 0.158 |
| RMSE | 127723.58 | 126847.51 |
Non-linear dependence between D and X, despite regression
Assuming neither linearity or additivity: fit an intercept for every combination of gender, profession, and age in years. (technically linear, but not an assumption)
| Linear/Additive | Linear/Interactive | Nonlinear/Interactive | |
|---|---|---|---|
| Hours Worked | 1076*** (115) | 1175*** (114) | 1305*** (113) |
| Male | 37223*** (2800) | 51887** (18118) | |
| Age (Years) | 3453*** (125) | 4985*** (341) | |
| Law | −48920*** (2675) | 105972*** (19357) | −46702*** (2614) |
| Num.Obs. | 10000 | 10000 | 10000 |
| R2 | 0.146 | 0.158 | 0.199 |
| RMSE | 127723.58 | 126847.51 | 123733.07 |
No dependence between \(D\) and \(X\), after regression
Even if we included all variables on backdoor path between \(D\) and \(Y\), regression may still produce biased estimate: